Mining Linguistically Interpreted Texts
نویسندگان
چکیده
This paper proposes and evaluates the use of linguistic information in the pre-processing phase of text mining tasks. We present several experiments comparing our proposal for selection of terms based on linguistic knowledge with usual techniques applied in the field. The results show that part of speech information is useful for the pre-processing phase of text categorization and clustering, as an alternative for stop words and stemming.
منابع مشابه
Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank
In the field of Human Language Technology (HLT), the existence of linguistically interpreted real-world texts provides the license necessary for a given language to enter the area of high-tech applications. The significance of BulTreeBank is the granting of an HLT license to a “less processed” language like Bulgarian which, until recently, has been formally modelled and processed mainly on the ...
متن کاملPattern Mining with Natural Language Processing: An Exploratory Approach
Pattern mining derives from the need of discovering hidden knowledge in very large amounts of data, regardless of the form in which it is presented. When it comes to Natural Language Processing (NLP), it arose along the humans’ necessity of being understood by computers. In this paper we present an exploratory approach that aims at bringing together the best of both worlds. Our goal is to disco...
متن کاملMultilayer model for Arabic text compression
This article describes a multilayer model-based approach for text compression. It uses linguistic information to develop a multilayer decomposition model of the text in order to achieve better compression. This new approach is illustrated for the case of the Arabic language, where the majority of words are generated according to the Semitic root-and-pattern scheme. Text is split into three ling...
متن کاملBack to the Roots of Genres: Text Classification by Language Function
The term “genre” covers different aspects of both texts and documents, and it has led to many classification schemes. This makes different approaches to genre identification incomparable and the task itself unclear. We introduce the linguistically motivated text classification task language function analysis, LFA, which focuses on one well-defined aspect of genres. The aim of LFA is to determin...
متن کاملBridging the Gap between Domain-Oriented and Linguistically-Oriented Semantics
This paper compares domain-oriented and linguistically-oriented semantics, based on the GENIA event corpus and FrameNet. While the domain-oriented semantic structures are direct targets of Text Mining (TM), their extraction from text is not straghtforward due to the diversity of linguistic expressions. The extraction of linguistically-oriented semactics is more straghtforward, and has been stud...
متن کامل